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Abstract 

Cluster analysis of biological samples using gene expression measurements is a common task which 
aids the discovery of heterogeneous biological sub-populations having distinct mRNA profiles. Several 
model-based clustering algorithms have been proposed in which the distribution of gene expression 
values within each sub-group is assumed to be Gaussian. In the presence of noise and extreme ob- 
servations, a mixture of Gaussian densities may over-fit and overestimate the true number of clusters. 
Moreover, commonly used model-based clustering algorithms do not generally provide a mechanism 
to quantify the relative contribution of each gene to the final partitioning of the data. We propose a 
penalised mixture of Student's t distributions for model-based clustering and gene ranking. Together 
with a bootstrap procedure, the proposed approach provides a means for ranking genes according to 
their contributions to the clustering process. Experimental results show that the algorithm performs 
well comparably to traditional Gaussian mixtures in the presence of outliers and longer tailed distri- 
butions. The algorithm also identifies the true informative genes with high sensitivity, and achieves 
improved model selection. An illustrative application to breast cancer data is also presented which 
confirms established tumor subclasses. 



1 Introduction 



Microarray gene expressions studies are routinely carried out to measure the transcription levels of an 
organism's genes. A common aim in the analysis of expressions measurements observed in a population is 
the identification of naturally occurring sub-populations. In cancer studies, for instance, the identification 
of sub-groups of tumours having distinct mRNA profiles can help discover molecular fingerprints that will 



define subtypes of disease (Smolkin and Ghosh 2003 1 . 



Many different approaches have been suggested for partitioning biological samples, including hierar- 
chical clustering, K-means and probabilistic models based on finite mixtures of distributions. One of the 
widely recognised advantages of model-based clustering lies in the fact that it explicitly accounts for the 



experimental noise that is typical of microarray studies (Liu and Rattray 2010). The gene expression 



measurement within each cluster are modelled as random variables drawn from a group-specific probability 
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distribution, which is often taken to be Gaussian; several well-known parameter estimation algorithms exist 



and have been applied to gene expression data (Qu and Xu 2004 He et al. 2006 Liu and Rattray 20101. 

Despite the popularity of mixture models based on Gaussian components, this particular choice of 
probability distributions may not always be ideal, especially in the presence of high measurement noise. A 
mixture model that uses Gaussian components may suffer from a lack of robustness against outliers, which 
in turn may lead to an inflated number of detected clusters. This is because of the the fact that additional 



components are needed to capture the heavy tail distribution that characterise certain groups ( Qu and Xu 



2004 Melnykov and Maitra 2010 Liu and Rattray 2010). As an alternative to Gaussian components, 



Student t— distributions have been successfully used for robust model-based clustering in several application 



domains (Peel and McLachlan 2000 Liu and Rubin 1995), including the analysis of gene expression data 
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thus yielding heavier tails, and making it more robust to outliers or sampling errors (Kotz and Nadarajah 



2004 Peel and McLachlan 2000). 



The majority of algorithms for unsupervised data partitioning use all the variables to describe the 
objects to be clustered. However, in microarray studies, it is expected that not all the gene expression 
measurements will necessarily contribute equally to the identification of distinct sub-groups of samples. 
Even when real clusters exist and are well separated, is it often the case that only a subset of genes 
will have expression levels that significantly vary across groups. Failing to identify the truly informative 
gene expressions may yield inaccurate clustering results because the non-informative genes will mask the 
underlying structure of the data. The problem of sub-space clustering consists in detecting clusters that 
only exist in a reduced number of dimensions, and can be addressed as a variable selection problem. 
However, compared to supervised learning settings, such as regression and classification, variable selection 
in clustering is notoriously more difficult. 

Motivated by the challenges posed by the high-dimensional settings, recently there has been a burst of 
activity on variable selection methods applied to model-based clustering. Within the Bayesian estimation 



framework, most studies have adopted a specific prior that induces a sparse solution (Liu et al. 2003 



Tadesse et al. 2005 Yau and Holmes 2011). Maximum likelihood estimation approaches achieve variable 



selection imposing penalties constraints on the likelihood which has the effect of shrinking some parameters 



to common values ( Pan and Shen 2007 Xie et al. 2008 Wang and Zhu 2008 ) . Since these methods rely 
upon the normality assumption, their performance is expected to degrade as the level of noise increases, 
and generally they do not provide means for ranking the genes in order of clustering importance. 

In this work we propose a robust probabilistic clustering algorithm that simultaneously: (a) identifies 
the informative genes and ranks them by importance, and (b) discovers data clusters that emerge only when 
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considering the expression levels of the selected genes. Our approach consists of a penalised finite mixture 
of Student's t distribution for model-based clustering and gene selection. As noted above, the use of group- 
specific distributions having longer tails results in clustering assignment that are less prone to be affected 
by extreme or unusual observations, and facilitates the discovery of the true underlying number of clusters. 
Gene selection is achieved by imposing a penalty term on cluster-specific model parameters. Moreover, we 
propose a data resampling procedure to quantify the contribution of each gene to the clustering process 
and improve the inferential process of discovering the true number of clusters. 

The article is organised as follows. In Section 2 we introduce the proposed penalised finite mixture 
of Student's t distributions, derive an EM algorithm for parameter estimation, and present a bootstrap 
procedure that, when combined with the BIC criterion, enhances the model selection process. In Section 3 
we present results based on extensive Monte Carlo simulations which we use to assess the performance of 
the proposed methodology under different data generating mechanisms and in comparison to two competing 
algorithms. In Section 4 we discuss an application of our clustering framework to a breast cancer data set. 
We conclude with a discussion on potential directions of future research in Section 5. 



2 Penalised Finite Mixtures of ^-Distributions 
2.1 Model 

We observe p gene expression measurements on n independent samples. These values are collected in n 
vectors yi, . . . , y n , which are then arranged in an n x p data matrix y and have been standardised. Our 
first objective is to partition the n samples into naturally occurring groups. We assume that the density 
of y is a mixture of g components, one component for each cluster, 

n g 

(1) .^^E^/.fei .) 

j=l i=l 

where \I> = {tt, @i, . . . , g } is the parameter set containing the mixing coefficients tv = (m, . . . , ir g ), 
summing up to one, and the parameter sets ©i, . . . , & g that fully specify the mixture components. 

Furthermore, we assume that each component of the mixture follows a multivariate Student t— distribution 
having density 

f( . , A= r((^ + P )/2)|s|-v 2 

J[m ^ ' ' r(i//2)(7ri/)P/ 2 {l + ( 5(y; M , 53)/i,}("+p)/2 ' 
where T(-) is the gamma function and S(y; (i, 53) = (y—fJ,) T 'E~ 1 (y—fi) is the Mahalanobis squared distance. 

In this case, each 0^ includes a p-dimcnsional location vector /x^, a p x p covariance matrix 53^ and the 

vector of degrees of freedom Vi. The log-likelihood function can then be written as 

n r / g \ 

(2) \ogL(*) = J2 1 og(£Ti/i(VjlMi,S i ,i/ j ) 

j=l L \i=l J 
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and the inferential problem consists in estimating \I/ from the data. In the following derivations we assume 
a diagonal covariance matrix, = diag(af 1; . . . , cf p ) with <r^ = {cr^i, • ■ • , Cj,p} the p-dimensional vector 
of standard deviations. This assumption makes the algorithm scale well to very high dimensional setting 



without hindering its ability to select the truly informative variables; see for instance Xie et at. (2008). Our 
second objective is to identify the gene expressions that are relevant or informative for the identification 
of the g clusters. 

We embrace a penalised log-likelihood framework, in which the log-likelihood function ^ is replaced 

by 

(3) log L p (¥) = log L(¥)-fc A (®i. •••.»*) 

where h(-) is a penalty function that depends on the parameters of the mixture components as well as a 
regularisation parameter vector A. The effect of the penalty is that of shrinking the estimated parameters 
of the uninformative variables towards a single value that is common across all mixture components. 

For finite mixtures of Gaussian distributions, the maximisation of ([3]) under a penalty on the Li-norm 



of the location and scale parameters has been proved to identify uninformative gene expressions (Pan and 



Shen 2007 Xie et al. 2008[ ). In this study we adopt a similar penalty term for our model ([2]), that is 



(4) h x = E M + A - E E i lo s <i\ 

i = l d=l i=l d=l 

where X = (A^, X a ). According to this penalty, the log- likelihood |2| has a penalisation that is proportional 
to the absolute values of the means and log- variances, where € M. + and \ a € M + are the regularisation 
parameters for means and variances, respectively. 

The model is first re-written in a missing data framework, which facilitates parameter estimation 



(Mclachlan and Peel 2000). A g-dimensional component-label vector Zj — (Zji, ■ ■ ■ , Zj g ) is introduced to 
indicate the cluster membership. With probability m, each value Zi j is equal to one if the observation yj 
belongs to component i and zero otherwise. Then Zj follows a multinomial distribution with parameters 
(jti, . . . , n g ). We adopt a hierarchical representation of the Student's t distribution. Conditional on Zjj = 1, 
Uj follows a t distribution. Introducing a second latent variable Uj , and using a hierarchical representation, 
we write 

(5) u M ~Gamma^.^j //, N i^fh. ^' 

where Gamma(-, •) and N(-, •) are the gamma and the multivariate normal density functions, respectively. 
The complete data becomes y c = {yj, Zj, %}, and the complete data log- likelihood function can then be 
factorized as 
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(6) logLp(¥) = J 1 (ff) + Za(f) + i3(M.E)-/iA(e) 

where 

g n 

'l( 7r ) = EE^ iog 71 "* 

i=l 3=1 
a n 

= E E lo s r (|) + 2 logr( f ) 

i=l j=i 



9 n ^ 

is(M. s ) = E E z «{-f lo g( 2?r ) + 1 lo sK) - 9 lo § i s « 



1=1 J=l 

- 2 u j - Mi) T E< _1 (i/j - mO} 

and /ia(©) is as given in Q. 
2.2 The EM algorithm 

Maximum likelihood estimation (MLE) is carried out by determining the parameter values for \& that 
maximise ([3]). As commonly done in similar settings, we develop an expectation-maximization (EM) 
algorithm ( |Dempster el dl. 1977). At each iteration k the algorithm first evaluates the expectation of the 



penalised log-likelihood function of the complete data conditional on 1 4'( fe_1 ) and then updates the MLE 
of , until convergence to a local maximum. 

The conditional expectation of the complete data taken with respect to the posterior probability of the 
latent variables is 

(7) QpC*!*^- 1 )) = {log L c (¥) I y} - h x (&) 

where E^,(fc-i){logL c ( x l/)| y} is the conditional expectation of the log-likelihood. It can be noted that the 
penalisation term h\(&) does not depend on any latent variable. Using ([6]), the conditional expectation 
of the log-likelihood can be decomposed as 

Q^nl^-V) + Q 2 H* (fc-1) ) + Qs(M, E|* (fc_1) ). 

At iteration k 7 in the E-step the expected values of the latent variables are derived from their condi- 
tional posterior distribution given v|/( fc ) . The conditional expectation of the indicator variable z, that is 
E^o-ij (zijlvj), is given by 

(fe-i) , / I (fe-i) ^,(fc-i) (fc-i)\ 

n h{v 3 m > s » A ) = (fe) 



() 



whereas the conditional expected value of the scaling factor, is given by 

E*(*-D (ttjl yj.^j = 1) = (fc -i) ^ . (Ji) ( fe -i), = U S- • 
Finally, the conditional expected value of the log precision factor is obtained as 

E^(fc-i) (bgujltf,-,^ = 1) = logujj + |^ 2 +P ) - lo S P 2 +P ) } 

where is the digamma distribution, ip(s) — {dT(s)/ds}/T(s). 

In the M-step the updated estimate of * is found by maximising (M) given ^C*- 1 ) and given the 
expected values of the latent variables computed in the E-step, that is 

(8) * (fe) = argmax g P (*|* (fc - 1) ) 

i> 

Since all terms in ^ are additive and depend on different parameters, we can solve ^ by maximising 
each term separately. The new estimated value of 7r is the root of the derivative of Qi(n\^^) with respect 
to 7r. Using a Lagrange multiplier to enforce the constraint X)f=i n i = 1) the update for tv{ is 

n 

(fc) (fc) , 

i=i 

The term Q 2 (f| 1 4' ( - fc ~ 1 - ) ) is a function of the degrees of freedom. No closed form solution is available, but 
the first derivative with respect to vi is smooth enough to have a straightforward numerical solution that 
can be found by any standard optimisation algorithm. In our implementation, we use the PORT routines 
in the R package nlminb. 

The third term Qs(fJ., S|\&( fc-1 )) is the only one depending on parameters that are subject to regulari- 
sation. In this case we set up a constrained maximisation problem that accounts for the relevant penalty 

term. First, an update for the penalised mean vector fj, is then obtained by finding the maximum of 

an g p 

i=l j=l i=l d=l 

Despite this being differentiable with respect to fm only when fj,^ ^ 0, we can still set the derivative to 
zero and solve: 

u [k) 

(9) T iJ ~ ^ ~ A ^ Sign (W.d) = 

j — 1 ^i,d 

while for the singular case where & = the following inequality holds true: 



(io) siUi^SWi < v 

a i,d 

Combining ^ and ( 10 1 the updating algorithm for /Xi becomes 



r (*0 „(*0 1 
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where it can be seen that the unpenalised MLE of the mean, that is 



r n T [k) u {k) v- 



(fe) (fe) 



is shrunk towards zero by an amount that increases with A M and is proportional to the variance scaled by the 
precision factor. When A M is sufficiently large, collapses to zero thus making variable yd uninformative. 
In an analogous way, the update for S is found by maximizing: 

g n g P 

E E <to(=«i* (fc - 1) ) -x^i: i log -Li 

i=l j = l i=l d=l 

for which is differentiable everywhere except for a^d = 1. When cr^^ ^ 1 its derivative with respect to 
<?i,d is 



(11) 



E' 

3=1 



-<*> L 

W 2(7 



u li ) (Vj,d ~ Mi,d) 2 \ A <r sign (log of jd ) 



Z , d 



2 a? 



'i,d 



while for a i c \ = 1 we have 
(12) 



E 



1 , u\j (Vj,d - fi>i,d? 


2 2 



< A(j . 



The final update is obtained combining (11)-(12), which gives the following updates for <r i y , 

sign(|6, - Ci\ - Xfj ) _|_ + 1 



2( fe ) 



1 + A CT sign(ci - 6i)/6, 



1 



where crf^ = c^/b^ is the unpenalized maximum likelihood estimate of the standard deviation with 

n n 

^=E<?A =^rg)ug(y,-Mf ) ) 2 /2 
3=1 i=i 

It can be noted that, when X a is sufficiently large, tr^d is set to be one, thus making the d th variable 
uninformative . 



2.3 Bootstrap Strategies for Model Selection 

Parameter estimation using the EM algorithm is carried out for a fixed number of components, g, and a 
fixed penalty vector A controlling how many variables are retained as informative. The selected variables, 
for a given value of A, are collected in the set S\ C {1, . . . ,p} having cardinality m. 

Both the optimal number of mixture components and level of penalisation can be found by exploring a 
finite number of solutions guided by the Bayesian Information Criterion (BIC). In our penalised likelihood 
framework, we use a modified version, BIC = — 2 log Lp(\&) + r log(n), where n is the number of samples 
and r = g — 1 + 2 g m + g — q is the effective number of parameters once the q — p — m non-informative 
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variables are excluded from the model (Pan and Shen 2007). However, the BIC criterion does not always 



lead to the correct choice of the best model (Baek and McLachlan 20111. In our experience, when m is 
very small compared to p and when the densities are fat tailed we find that this criterion still prefers too 
complex models as some degree of over-fitting still takes place. We therefore propose a bootstrap approach 
that is similar to the stability selection procedure originally developed for variable selection in penalised 



regression ( Meinshausen and Biihlmann 2010). This procedure enhances model selection, but also provides 
a way to rank the selected variables. 

Initially we assume that g is fixed. For a given g, we are interested in selecting an optimal set of 
informative genes, which should be ranked in decreasing order of importance. We search for a value of A 
that minimises the modified BIC criterion, and call this optimal value A*. This search can be carried out 
using a grid of candidate points. Then, B sub-samples of the data are randomly drawn with replacement, 
{y^yf—i, ah having size [ti/2]. For each random sub-sample y^ b ', we fit the penalised mixture model using 
the EM algorithm with the given number of components g and regularisation A*. The set of variables 
selected in each sub-sample is called S\ * . 

An indicator variable Id(Si)) is introduced which equals 1 if the variable d has been flagged as infor- 
mative for y^ h \ and zero otherwise. The selection probability for gene d is then estimated as 

TT d = — , d = l,2,...,p. 

It should be noted that, whereas a single model fit obtained with the EM algorithm using the whole 
data set would only provide a binary indicator labelling each variable as informative or not, the selection 
probabilities provide a useful metric to assess the relative importance of each gene. This is with regard 
to both clustering as well as for ranking purposes. All the variables having a sufficiently high selection 
probability are then deemed informative. A threshold on tt^ could be selected to control the number of false 



positives, as in Meinshausen and Biihlmann (20101, although little theoretical developments are available 
yet. 

Apart from enabling to rank genes, the resampling approach provides the means to improve upon the 
model selection process. In order to estimate the correct number of mixture components, g, it is common 
routine to compare a series of models, each one having an increasing number of components, say from 2 
to fc ma x7 and select the model with the smallest BIC. However, when the ratio of non-informative over 
informative genes is high, we have found that in practice the modified BIC still tends to overestimate the 
number of clusters. We address this issue by proposing the following two-step procedure. For each one of 
the k max — 1 models being compared, we carry out the bootstrap procedure as described above, and collect 
in a set of cardinality m the informative variables selected by all models over all bootstrap replicates. In 
a second step, we re-fit all competing models, but instead of using all the p genes, we use only the m 



informative genes, where m is usually much smaller than p. The selected model is the one that minimises 
the BIC, as usual. By initially removing the non-informative variables, and therefore the amount of noise, 
this simple approach reduces the bias towards selecting more complex models, and improves upon the 
selection of the number of components, as shown in Section [3} 



3 Simulation Experiments 

The gene ranking and model selection procedures have been assessed using extensive Monte Carlo simula- 
tions. We assume a sample size of n = 200, and m = 20 variables informative for clustering. The number 
of uninformative variables, q, is always taken to be much higher than m. The informative variables are 
sampled from a mixture of g multivariate Student's t distributions. All the uninformative variables share 
the same parameters and also follow a Student's t distribution. 

In order to explore the effects of having fatter tails on both clustering and variable selection performance, 
we consider two scenarios: a low degrees of freedom case (Low DoF), in which the distributions have tails 
that are more pronounced compared to multivariate Gaussians, and a high degrees of freedom case (High 
DoF), that is approximately Gaussian. We simulate data with both two and tree components. When 
g = 2, the parameters of the densities are chosen to ensure that there is roughly a 30% overlap between 
them; when g = 3, which we indicate as A, B and C, the parameters are chosen so that there is about 25% 
overlap between A and B, 30% overlap between A and C, and around 5% overlap between B and C. In a 
separate experiment we specifically assess the effects of increasing the number of uninformative variables, 
and consider two cases: q = 200 (low noise) and q — 2000 (high noise), while still keeping the number of 
informative variables fixed at m = 20. 

We employ three performance indicators. The clustering performance, that is the ability to identify 



the correct clusters, is assessed using the Adjusted Rand Index (AM) (Hubert and Arabie 1985), which 
measures the proximity between any two partitions of the samples, and therefore quantifies how close 
the estimated cluster memberships are from the ground truth. This index ranges between (random 
assignment) and 1 (perfect matching). The ability of the model to identify truly informative variables is 
summarised by the sensitivity index which is computed as the ratio of informative variables that have been 
correctly selected by the algorithm (true positives) over m, the informative variables selected by the model. 
In order to quantify the ability of the model to exclude truly uninformative variables, we also compute 
a specificity index as the ratio between the number of truly negative variables over q, the uninformative 
variables selected by the model. 

We compare the performance of the suggested penalised Student t— mixture model (PTM) with two 
competing clustering methods that simultaneously partition the samples and identify the relevant genes: 
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the penalised Gaussian mixture (PGM) ( Xie et al. 2008 ) which has been implemented in R, and the sparse 



if -means algorithm (PKM) (Witten and Tibshirani 20101, which uses a lasso-type penalty to select the 
variables and obtain a sparse hierarchical clustering, as implemented in the R package sparcl. 

For each setting being considered, we generate 100 independent data sets and report on Monte Carlo 
averages. In Table [l] we consider two and three clusters, and fit the three competing models. In all cases, 
the correct number of clusters is pre-specified and only the number of selected variables is learnt from the 
data using model selection. The modified BIC criterion is used to select the degree of penalty in both 
PGM and PTM models, without using the bootstrap procedure, which is evaluated separately later. For 
PKM, we use the built-in model selection procedures that rely on multiple permutations of the data. All 
methods are assessed using ARI, sensitivity and specificity indexes. 



.9 = 2 .g = 3 

DoF Model ARI Sens Spec ARI Sens Spec 

PKM 0.33 0.38 1.00 0.10 0.20 1.00 
Low PGM 0.57 0.60 0.96 0.40 0.70 0.98 
PTM 0.72 0.76 0.97 0.56 1.00 0.96 

PKM 0.62 0.66 1.00 0.37 0.77 1.00 
High PGM 1.00 1.00 1.00 0.71 1.00 0.99 
PTM 1.00 1.00 1.00 0.71 1.00 0.99 

Table 1: Performance assessment of three competing sparse clustering methods. Data simulated with parameters 
n = 200, m = 20, and q — 2000. The correct number of mixture components is assumed known, and variable 
selection is performed for each simulated data set. 



In the low degrees of freedom and g = 2 case, PTM achieves the highest sensitivity index at the cost of a 
marginally lower specificity compared to the other two models. This ability to identify and retain the truly 
informative variables translates into the highest average ARI for PTM. As expected, PKM performs poorly 
in this case as no probabilistic model is assumed, and the model is more sensitive to extreme observations. 
In the g = 3 case, the specificity of all three competing models is still comparable, but PTM achieves the 
highest sensitivity that leads to the best clustering performance. When the distributions are Gaussian, 
both PGM and PTM have similar performances, as expected in this case, whereas performance of PKM is 
lower, especially with three clusters. 

The potential improvements that can be gained from the resampling procedure of Section [2.3| in terms 
of both variable selection and clustering performance, are summarised in Table [2] Here we consider four 



11 





Low 


DoF 


Hi£ 




DoF 


Bootstrap 


q = 200 


q = 2000 


q = 200 




q = 2000 
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0.88 




0.82 
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0.92 


0.86 


0.98 




0.93 



Table 2: ARI for the PTM model, with and without bootstrap. Data simulated with parameters g — 3, n = 
200, m = 20. 



scenarios whereby we vary the distribution used to generate the data within each cluster as well as the 
number of uninformative variables. The data are sampled from a mixture of three multivariate Student 
t— distributions. The sample size is n = 200, with m — 20 informative variables while the number of 
uninformative variables q is taken to be both 200 and 2000. After running the bootstrap procedure with 
a fixed probability selection threshold, tt = 0.7, the improved ability of the model to exclude the noise 
variables improves the clustering performance, as quantified by the ARI. The improvement is particularly 
remarkable when the distributions have longer tails. 





Low DoF 


Hig 


h DoF 


Bootstrap 


q = 200 q= 2000 


q = 200 


q = 2000 


No 


0.5 (3.7) 0.0 (4.7) 


0.72 (3.3) 


0.55 (3.75) 


Yes 


0.85 (2.85) 0.3 (2.35) 


1.00 (3) 


1.00 (3) 



Table 3: Percentage of correctly identified mixture components in the PTM model, with and without bootstrap. 
Data simulated with parameters g = 3, n = 200, m = 20. The average number of clusters is in brackets. 



In Table [3] we explore the effects of the two-step bootstrap approach described in Section 2__3 for the 
selection of the number of mixture components. For this experiment, we use the simulated data sets used 
to produce the results of Table [2j where the true number of clusters is g = 3. For each scenario being 
considered, we search among models having up to five clusters. We report on the percentage of times the 
correct number of clusters is selected by the two strategies, with and without bootstrap. As expected, both 
model selection strategies perform better when the distribution of the simulated variables is Gaussian. 
In both high and low degrees of freedom scenarios, there is a notable performance gain when using the 
resampling scheme especially so in particularly demanding conditions when the data have fatter tails and 
the number of noise variables is high. We note also that in all scenarios the average number g selected, 
quoted in brackets in Table [3j is higher for the first strategy. This evidence confirms that, by reducing the 
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number of noise variables, the bootstrap approach alleviates the overfitting problem which is particularly 
important in high dimensional settings. 



3.1 Application to Breast Cancer Data 



The proposed sparse mixture model was used for the analysis of a publicly available breast cancer data 
set consisting of n = 128 early-stage tumors from those collected at Nottingham City Hospital NHS Trust 



between 1986 and 1992 (Naderi et al. 2007 Blcnkiron et al. 2007). This cohort of tumours is representative 



of the demographics of breast cancer and the majority of patients were post-menopausal. Microarray 
data recording the expression levels of p = 36, 939 probes are available in the ArrayExpress database 
(www.ebi.ac.uk/arrayexpress) under accession number E-TABM-576. Our analysis was unsupervised and 
aimed at simultaneously detecting clusters and ranking genes based on their contribution to the cluster 
assignments. 

Following the model selection procedure detailed in Section [2j we inferred the number of clusters by 
exploring up to five possible clusters. For this data set, the minimum modified BIC supports a 3-component 
mixture. The resampling approach was also used to estimate the selection probability of each probe. The 
heatmap of Fig. [I] shows the 1128 probes with selection probability at least as large as 0.7 (about 3% 
of the total), and their expression patters across the 128 tumors. In order to interpret these clustering 
allocations, we explored the correlation with the clinical sub-groups induced by the Estrogen Receptor (ER). 
Approximately 80% of human breast carcinomas present an estrogen receptor a-positive (ER+) disease, 
with ER+ breast cancers responding well to therapies and ER- negative tumours being more resistant. 
ER status is an essential determinant of clinical and biological behaviour of human breast cancers and 
it is well known that the major molecular features of breast cancer segregate differently according to ER 



status (Schneider et al.. 20061. With g = 2, our analysis finds clusters that overlap with ER-positive and 



ER- negative tumours; these results are consistent with previous findings on independent data sets (Van 't 



Veer et al. 2002 for instance). In this case, 82.5% of all ER- tumors fall into cluster A and the remaining 
ones into cluster B. When an additional cluster is added (g = 3), we observe a split of the ER+ dominated 
cluster B into two groups: 81% of the samples in cluster A are still ER-, whereas the majority of samples 
in B and C arc ER+ (88% and 76%, respectively). The ER status as well as the cluster assignments for 
both g = 2 and g = 3 are reported at the top row of FigjT] 

Several genes that have been selected with high selection probability by PTM had been previously 



included in independent ER signatures (Abba et al.\ 2005| Thakkar et al. 2010). These genes include (with 



selection probabilities): FOXA1 (0.935), NTN4 (0.906), CSNK1A1 (0.892), STC2 (0.885), SF3B3 (0.863), 
MMP12 (0.860), ITGB7 (0.845), CA12 (0.831), MLPH (0.827), GATA3 (0.824), NAT1 (0.802), GABRP 
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Figure 1: Heatmap showing the expression levels in 1128 the top-ranked genes (rows) in the 128 tumors (columns). 



(0.781), DUSP4 (0.745), NUTF2 (0.777), XBP1 (0.756), SLC43A3 (0.752), PLAT (0.727), ESR1 (0.727) 
and AARS (0.713). Among the genes with highest selection probability, FOXA1 has very recently been 



found to be a key determinant of ER function and endocrine response (Hurtado et al. 2011). Moreover, 



FOXAl, GATA3 and ESR1 have also been found to be associated to ER status in an analysis of invasive 



ductal carcinoma (Schneider et al. 2006). Genetic variants immediately upstream of ESR1 have been 



linked to breast cancer risk. Very recently, Dunbier et al. (2011) found that three open reading frames 



within this region are tightly co-expressed with ESR1, and investigated the function of these three genes: 
C60RF97, C60RF96 and C60RF211. Their findings suggest that the genes could contribute to the 
phenotype associated with ER positivity. In addition, they may be involved in the mechanism by which 
genetic variation in this region of the genome contributes to breast cancer susceptibility. These three 
genes have also been found to have high selection probabilities in our ranking (0.770, 0.713 and 0.709, 
respectively) . 

We further investigate whether the three clusters may indicate clinically distinct subgroups of patients. 
We report on the frequency distribution of the Nottingham Prognostic Index (NPI) in the clusters detected 
by the mixture model, using both two and three mixtures components. This index uses pathological 
variables, such as tumor size and grade, to generate a prognostic score for each patient that is predictive 
of outcome (Callagy et al. 2006). When using two components (see Fig[2|, the NPI present a different 
distribution in the two clusters, which are representative of ER+ and ER- tumors. When using three 
mixture components (see Fig p3j) , we find that patients in subgroup C have a statistically significantly 
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higher NPI index compared to those in B, and are more similar to the ER- dominated cluster A, for which 
response to therapy is known to be less favourable. 

NPI by Cluster 




Cluster 



Figure 2: NPI by cluster using two components. 



NPI by Cluster 




B C 
Cluster 



Figure 3: NPI by cluster using three components. 



Different studies have tried to identify breast carcinomas subtypes beyond the simple classification 



15 



between positive and negative ER status (Sorlie et al. 2001 Calza et al. 2006 Sotiriou and Pusztai 



2009 Nicolau et al. 2011). The subtypes varies across studies and there is no clear consensus on the 



gene expression signatures that identify each group. In order to better interpret our results, we have 
compared our proposed clustering based on the genes selected by our penalised mixture model (PTM) 



to the clustering obtained by using only the 87 genes identified as informative by Calza et al. (2006) in 



their study of 412 breast cancers patients from Stockholm and Uppsala, Sweden. Their gene expression 



signatures were originally proposed by Sorlie et al. (2003) and consisted of approximately 500 genes whose 



expression varied the least in successive samples from the same patient's tumor but which showed the most 
variation among tumors of different patients. Genes were median-centered and subtypes were isolated using 
an average-linkage hierarchical clustering algorithm based on uncentered correlation distance metric. 




CXCL1 
FLJ14525 
ZNF532 
FZD7 
CDK6 
FLJ12438 
TRIM29 
CDK2AP1 
DSC2 
PLSCR1 
PRAME 
CHI3L2 
VGLL1 
FOXC1 



Figure 4: The Basal signature of 



Calza et al. 



{ 2006 1 applied to the Nottingham City Hospital NHS Trust cohort 



identifies a Basal-like subtype that overlaps with Cluster A obtained by PTM. 



The heatmap in Figure [4] shows the gene expression patterns in our cohort of 128 samples using the 



Basal- like signature provided by Calza et al. ( 2006 ) . Our cluster A contains all the samples in which the 



marker genes are highly expressed, and is characterised by an higher percentage of ER- cases, as expected 
for this subtype. The clinical prognosis associated to cluster A is also in line with what has been previously 
reported in the literature whereby patients of this group show an above average tumor size, higher frequency 
of grade three carcinomas and are less responsive to therapy. 



4 Discussion and Conclusions 

In this work we have proposed a penalised finite mixture of Student t— distributions which performs gene 
selection and data clustering in a penalised maximum-likelihood estimation framework, fitted via the EM 
algorithm. A novel bootstrap procedure for model selection and variable ranking has also been developed. 
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Using simulated data under various different settings, we have shown that the proposed methods are 
expected to be particularly beneficial in the presence of many uninformative gene expressions and fat 
tailed distributions. 

The algorithm has been applied for the analysis of a large data set consisting of breast cancer tumors. 
Our results confirm the importance of top ranked genes in the their role of differentiating between ER+ 
and ER- samples. Moreover, there is an indication that the selected genes may be important to define a 
clinically important subtype of E+ enriched tumors, which warrants further investigation. 

As an extension to this work, we can consider Bayesian estimation. As is consistent with the literature on 
the lasso, one can easily place a Bayesian interpretation on our penalised mixture of Student t— distributions. 
Given the natural probabilistic nature of selecting g in the Bayesian framework, it is of interest to develop 
advanced simulation-based estimation techniques that are required for our model. 
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